unsafe concept
Towards Safe Concept Transfer of Multi-Modal Diffusion via Causal Representation Editing
Recent advancements in vision-language-to-image (VL2I) diffusion generation have made significant progress. While generating images from broad vision-language inputs holds promise, it also raises concerns about potential misuse, such as copying artistic styles without permission, which could have legal and social consequences.
ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection
Ma, Ruize, Cai, Minghong, Jiang, Yilei, Han, Jiaming, Feng, Yi, Tan, Yingshui, Zhu, Xiaoyong, Zhang, Bo, Zheng, Bo, Yue, Xiangyu
Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation. Our code is available at https://github.com/Ruize-Ma/ConceptGuard.
- North America > United States (0.14)
- Europe > Austria > Vienna (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (2 more...)
- Law (1.00)
- Information Technology (0.93)
Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities
Qu, Yiting, Backes, Michael, Zhang, Yang
Vision-language models (VLMs) are increasingly applied to identify unsafe or inappropriate images due to their internal ethical standards and powerful reasoning abilities. However, it is still unclear whether they can recognize various unsafe concepts when presented in different modalities, such as text and images. To address this, we first compile the UnsafeConcepts dataset, featuring 75 unsafe concepts, i.e., ``Swastika,'' ``Sexual Harassment,'' and ``Assaults,'' along with associated 1.5K images. We then conduct a systematic evaluation of VLMs' perception (concept recognition) and alignment (ethical reasoning) capabilities. We assess eight popular VLMs and find that, although most VLMs accurately perceive unsafe concepts, they sometimes mistakenly classify these concepts as safe. We also identify a consistent modality gap among open-source VLMs in distinguishing between visual and textual unsafe concepts. To bridge this gap, we introduce a simplified reinforcement learning (RL)-based approach using proximal policy optimization (PPO) to strengthen the ability to identify unsafe concepts from images. Our approach uses reward scores based directly on VLM responses, bypassing the need for collecting human-annotated preference data to train a new reward model. Experimental results show that our approach effectively enhances VLM alignment on images while preserving general capabilities. It outperforms baselines such as supervised fine-tuning (SFT) and direct preference optimization (DPO). We hope our dataset, evaluation findings, and proposed alignment solution contribute to the community's efforts in advancing safe VLMs.
- Law > Criminal Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.48)
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?
Yang, Hongzheng, Chen, Yongqiang, Qin, Zeyu, Liu, Tongliang, Xiao, Chaowei, Zhang, Kun, Han, Bo
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > Middle East > Malta > Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Towards Safe Concept Transfer of Multi-Modal Diffusion via Causal Representation Editing
Recent advancements in vision-language-to-image (VL2I) diffusion generation have made significant progress. While generating images from broad vision-language inputs holds promise, it also raises concerns about potential misuse, such as copying artistic styles without permission, which could have legal and social consequences. To address these issues, researchers have explored various approaches, such as dataset filtering, adversarial perturbations, machine unlearning, and inference-time refusals. However, these methods often lack either scalability or effectiveness. In response, we propose a new framework called causal representation editing (CRE), which extends representation editing from large language models (LLMs) to diffusion-based models. CRE enhances the efficiency and flexibility of safe content generation by intervening at diffusion timesteps causally linked to unsafe concepts.
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation
Yoon, Jaehong, Yu, Shoubin, Patil, Vaidehi, Yao, Huaxiu, Bansal, Mohit
Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful or undesirable concepts (e.g., artist styles) without additional training. To address these challenges, we propose SAFREE, a novel, training-free approach for safe text-to-image and video generation, that does not alter the model's weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt token embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. By integrating filtering across both textual embedding and visual latent spaces, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the generated outputs. Empirically, SAFREE achieves state-of-the-art performance in suppressing unsafe content in T2I generation (reducing it by 22% across 5 datasets) compared to other training-free methods and effectively filters targeted concepts, e.g., specific artist styles, while maintaining high-quality output. It also shows competitive results against training-based methods. We further extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. As generative AI rapidly evolves, SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation. Content warning: this paper contains content that may be inappropriate or offensive, such as violence, sexually explicit content, and negative stereotypes and actions. Generation tools such as DALL E 3, Midjourney, Sora, and KLING have seen significant growth, enabling a wide range of applications in digital art, AR/VR, and educational content creation.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.46)
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models
Zhang, Hongxiang, He, Yifeng, Chen, Hao
Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.
- North America > United States > California > Yolo County > Davis (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
Ma, Jiachen, Cao, Anda, Xiao, Zhiqing, Zhang, Jie, Ye, Chao, Zhao, Junbo
Text-to-Image (T2I) models have received widespread attention due to their remarkable generation capabilities. However, concerns have been raised about the ethical implications of the models in generating Not Safe for Work (NSFW) images because NSFW images may cause discomfort to people or be used for illegal purposes. To mitigate the generation of such images, T2I models deploy various types of safety checkers. However, they still cannot completely prevent the generation of NSFW images. In this paper, we propose the Jailbreak Prompt Attack (JPA) - an automatic attack framework. We aim to maintain prompts that bypass safety checkers while preserving the semantics of the original images. Specifically, we aim to find prompts that can bypass safety checkers because of the robustness of the text space. Our evaluation demonstrates that JPA successfully bypasses both online services with closed-box safety checkers and offline defenses safety checkers to generate NSFW images.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Africa (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)